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Abstract 



We develop a new algorithm to cluster sparse unweighted graphs - i.e., partition the nodes into 
disjoint clusters so that there is higher edge density within clusters, and low across clusters. By sparsity 
wc mean the setting where both the in-clustcr and across-clustcr edge densities arc very small, possibly 
vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. 

Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within 
clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized 
differently. We analyze our algorithm's performance on the natural, classical and widely studied "planted 
partition" model (also called the stochastic block model); we show that our algorithm can cluster sparser 
graphs, and with smaller clusters, than all previous methods we are aware of. We provide empirical 
results as well. 



1 Introduction 

This paper proposes a new algorithm for the following task: given a sparse undirected unweighted graph, 
partition the nodes into disjoint clusters so that the density of edges within clusters is higher than the edges 
across clusters. In particular, we are interested in settings where even within clusters the edge density is 
low, and the density across clusters is a small multiplicative constant lower. 

Several large modern datasets and graphs are sparse; examples include the web graph, social graphs 
of various social networks, etc. Clustering naturally arises in these settings as a means/tool for community 
detection, user profiling, link prediction, collaborative filtering etc. More generally, there are several 
clustering applications where one is given as input a set of similarity relationships, but this set is quite 
sparse. Unweighted sparse graph clustering corresponds to a special case in which all similarities are either 



As has been well-recognized, sparsity complicates clustering, because it makes the problem noisier. 
Just for intuition, imagine a random graph where every edge has a (potentially different) probability pij 
(which can be reflective of an underlying clustering structure) of appearing in the graph. Consider now the 

An earlier version of this works appears at Neural Information Processing Systems Conference (NIPS), 2012. 



1" or "0". 



edge random variable, which is 1 if there is an edge, and else. Then, in the sparse graph setting of small 
Pij — ^ 0, the mean of this variable is pij but its standard deviation is ^/Pij, which can be much larger. This 
problem gets worse as pij gets smaller. Moreover, sparsity means the difference between in-cluster and 
across-cluster densities is also small, making the clustering structure less significant and harder to detect. 
Another parameter governing problem difficulty is the size of the clusters; smaller clusters are easier to 
lose in the noise. 

Our contribution: We propose a new algorithm for sparse unweighted graph clustering. Clearly, 
there will be two kinds of deviations (i.e. errors) between the given graph and any candidate clustering: 
missing edges within clusters, and present edges across clusters. Our key realization is that for sparse 
graph clustering, these two types of error should be penalized differently. Doing so gives us a combinatorial 
optimization problem; our algorithm is a particular convex relaxation of the same, based on the fact 
that the cluster matrix is low-rank (we elaborate below). Our main analytical result in this paper 
is theoretical guarantees on its performance for the classical planted partition model [llj, also called the 
stochastic block-model \18\ I26j. for random clustered graphs. While this model has a rich literature (e.g., 
[H El [m [Mj ) , we show that our algorithm outperforms (upto at most log factors) every existing method \n 
this setting (i.e. it recovers the true clustering for a bigger range of sparsity and cluster sizes). Both the 
level of sparsity and the number and sizes of the clusters are allowed to be functions of n, the total number 
of nodes. In fact, we show that in a sense we are close to the boundary at which "any" spectral algorithm 
can be expected to work. Our simulation study confirms our theoretic finding, that the proposed method 
is effective in clustering sparse graphs and outperforms existing methods. 

The rest of the paper is organized as follows: Section [lT] provides an overview of related work; Section[2] 
presents both the precise algorithm, and the idea behind it; Section [3] presents the main results - analytical 
results on the planted partition / stochastic block model - which are shown to outperform existing methods; 
Section [4] provides simulation results; and finally, the proof of our theoretic results is given in Section [5] 
andlH 

1.1 Related Work 

The general field of clustering, or even graph clustering, is too vast for a detailed survey here; we focus on 
the most related threads, and therein too primarily on work which provides theoretical "cluster recovery" 
guarantees on the resulting algorithms. 

Correlation clustering: As mentioned above, every candidate clustering will have two kinds of 
errors; correlation clustering [2] weighs them equally, thus the objective is to find the clustering which 
minimizes just the total number of errors. As we show below, doing so only applies to dense graphs. 
Correlation clustering is an NP-hard problem. Subsequently, there has been much work on devising ap- 
proximation algorithms for both the weighted and unweighted cases |2l [121 El E! • Approximations based 
on LP relaxation [13] and SDP relaxation [Ml [23], followed by rounding, have also been developed. Most 
of this line of work is on worst-case guarantees. We emphasize that while we do convex relaxation as well, 
we do not do rounding; rather, our convex program itself yields an optimal clustering. 

Planted partition model / Stochastic block model: This is a natural and classic model for 
studying graph clustering in the average case, and is also the setting for our performance guarantees. Our 
results are directly comparable to work here; we formally define this setting in Section [3] and present a 
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detailed comparison, after some notation and om' theorem, in Section |3] below. 

Sparse and low-rank matrix decomposition: It has recently been shown [8l|6] that, under certain 
conditions, it is possible to recover a low-rank matrix from sparse errors of arbitrary magnitude; this has 
even been applied to graph clustering jl9j . Our algorithm turns out to be a weighted version of sparse and 
low-rank matrix decomposition, with different elements of the sparse part penalized differently, based on the 
given input. To our knowledge, ours is the first paper to study any weighted version; in that sense, while 
our weights have a natural motivation in our setting, our results are likely to have broader implications, 
for example robust versions of PCA when not all errors are created equal, but have a corresponding prior. 



2 Algorithm 

Idea: Our algorithm is a convex relaxation of a natural combinatorial objective for the sparse clustering 
problem. We now briefly motivate this objective, and then formally describe our algorithm. Recall that 
we want to find a clustering (i.e. a partition of the nodes) such that in-cluster connectivity is denser than 
across-cluster connectivity. Said differently, we want a clustering that has a small number of errors, where 
an error is either (a) an edge between two nodes in different clusters, or (b) a missing edge between two 
nodes in the same cluster. A natural (combinatorial) objective is to minimize a weighted combination of 
the two types of errors. 

The correlation clustering setup gives equal weights to the two types of errors. However, for sparse 
graphs, this will yield clusters with a very small number of nodes. This is because there is sparsity both 
within clusters and across clusters; grouping nodes in the same cluster will result in a lot of errors of type 
(b) above, without yielding corresponding gains in errors of type (a) - even when they may actually be in 
the same cluster. This can be very easily seen: suppose, for example, the "true" clustering has two clusters 
with equal size, and the in-cluster and across-cluster edge density are both less than 1/4. Then, when both 
errors are weighted equally, the clustering which puts every node in a separate cluster will have lower cost 
than the true clustering. 

To get more meaningful solutions, we penalize the two types of errors differently. In particular, sparsity 
means that we can expect many more errors of type (b) in any solution, and hence we should give this 
(potentially much) smaller weight than errors of type (a). Our crucial insight is that we can know what 
kind of error will (potentially) occur on any given pair of node from the given adjacency matrix A itself. 
In particular, if Ojj = 1 for some pair i,j, when in any clustering it will either have no error, or an error of 
type (a); it will never be an error of type (b). Similarly if Uij = then it can only be an error of type (b), 
if at all. Our algorithm is a convex relaxation of the combinatorial problem of finding the minimum cost 
clustering, with the cost for an error on pair i, j determined based on the value ofatj . Perhaps surprisingly, 
this simple idea yields better results than the extensive literature already in place for planted partitions. 

We proceed by representing the given adjacency matrix A as the sum of two matrices A = Y + S, 
where we would like y to be a cluster matrix, with yij = 1 if and only if i,j are in the same cluster, and 



otherwisf ^ ^ S is the corresponding error matrix as compared to the given A, and has values of +1, 
and 0. 



^In this paper we will assume the convention that an = 1 and yu = 1 for all nodes i. 
^In other words, Y is the adjacency matrix of a graph consisting of disjoint cliques. 



3 



We now make a cost matrix C £ R"^" based on the insight above; we choose two values and c^c 
and set Cij = c_a if the corresponding aij = 1, and Cij = cj^c if aij = 0. With this setup, we have 

Combinatorial Objective: min HCoS"!!]^ (1) 

s.t Y + S = A 

y is a cluster matrix 

Here C o S denotes the matrix obtained via element-wise product between the two matrices C, S, i.e. 

CijSij. Also II • 111 denotes the element-wise ii norm (i.e. sum of absolute values of elements). 

Algorithm: Our algorithm involves solving a convex relaxation of this combinatorial objective, by 
replacing the "y is a cluster matrix" constraint with (z) constraints < yij < 1 for all elements and 
(a) a nuclear norrrj^ penalty ||y||* in the objective. The latter encourages Y to be low-rank, and is based 
on the well-established insight that the cluster matrix (being a block-diagonal collection of I's) is low-rank. 
Thus we have our algorithm: 

Sparse Graph Clustering: min II^IL + IIC* ° 5'||i (2) 

s.t. < yij < 1, Vi, j (3) 
Y + S = A, 

Once the optimal solution Y is obtained, check if it is a cluster matrix (say e.g. via an SVD, which will 
also reveal cluster membership if it is). If it is not, any one of several rounding/aggregration ideas can 
be used empirically. Our theoretical results provide sufficient conditions under which the optimum of the 
convex program is integral and a clustering, with no rounding required. Section |4] provides details on fast 
implementation for large matrices; this is one reason we did not include a semidefinite constraint on Y in 
our algorithm. Our algorithm has two positive parameters: c^, c^c. We defer discussion on how to choose 
them until after our main result. 

Comments: Based on the given A and these values, the optimal Y may or may not be a cluster 
matrix. If y is a cluster matrix, then clearly it minimizes the combinatorial objective above. Additionally, 
it is not hard to see (proof in Section [5]) that its performance is "monotone" , in the sense that adding 
edges "aligned with" Y cannot result in a different optimum, as summarized in the following lemma. This 
implies that, in the terminology of [231 HI US], our method is robust under a classical semi-random model 
where an adversary can add edge within clusters and remove edges between clusters. 

Lemma 1. Suppose Y is the optimum of Formulation for a given A. Suppose now we arbitrarily change 
some edges of A to obtain A, by (a) choosing some edges such that yij = 1 but Uij = 0, and making aij = 1, 
and (b) choosing some edges where yij = but Oij = 1, and making Oij = 0. Then, Y is also an optimum 
of Formulation ^ with A as the input. 

Our theoretical guarantees characterize when the optimal Y will be a cluster matrix, and recover the 
clustering, in a natural classical problem setting called the planted partition model These theoretical 
guarantees also provide guidance on how one would pick parameter values in practice; we thus defer 
discussion on parameter picking until after we present our main theorem. 

^The nuclear norm of a matrix is the sum of its singular values. 
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3 Performance Guarantees 



In this section we provide analytical performance guarantees for our algorithm under a natural and classical 
graph clustering setting: (a generalization of) the planted partition model [llj . We first describe the model, 
and then our results. 

(Generalized) Planted partition model: Consider a random graph generated as follows: The 
n = rii + n2 nodes are divided into two disjoint sets Vi and V2. The ni nodes in Vi are partitioned into 
r disjoint clusters, which we will refer to as the "true" clusters. Let K be the minimum cluster size. For 
every pair of nodes i,j that belong to the same cluster, edge (i, j) is present in the graph with probability 
that is at least p, while for every pair where the nodes are in different clusters the edge is present with 
probability at most q. The other n2 nodes in V2 are not in any clusters; for each i £ V2 and j £ Vi U V2, 
there is an edge between the pair i,j with probability at most q. The objective is to find the partition, 
given the random graph generated from it. 

We call this model the "generalized" planted partition model because it allows heterogeneity in the 
graph. Clusters can have different sizes, and the edge probabilities can be different; we only require them 
to be uniformly bounded as mentioned. There might also be nodes (i.e., V2) that are isolated and not in 
any cluster. 

Recall that A is the given adjacency matrix of the graph, and let Y* be the matrix corresponding to 
the true clusters as above - i.e. y*j = 1 if and only if i,j G Vi and they are in the same true cluster, and 
otherwise. Our result below establishes conditions under which our algorithm, specifically the convex 
program yields this Y* as the unique optimum (without any further need for rounding etc.) with 

high probability. Throughout the paper, with high probability (w.h.p.) means with probability at least 



Theorem 1 . Suppose we choose ca = min < a / , , / . > , and cao = . ^, min < ^ / , 1 > . 

^■^ 16vnlogn 1 V 9 y log^ n I ' 16vnlogn |^ y 1— p' J 

Then {Y*,A — Y*) is the unique optimal solution to Formulation w.h.p. provided q < j, and 



p-q . 2 

> ci—— log n. 



where ci is an absolute positive constant. 



Our theorem quantifies the tradeoff between the two quantities governing the hardness of a planted 
partition problem - the difference in edge densities p — q, and the minimum cluster size K - required for 
our algorithm to succeed, i.e. to recover the planted partition without any error. Note that here p, q and K 
are allowed to scale with n. We now discuss and remark on our result, and then compare its performance 
to past approaches and theoretical results in Table [T] 

Note that we need K to be Q{y/nlog^ n). This will be achieved only when p — q is a constant that 
does not change with n; indeed in this extreme our theorem becomes a "dense graph" result, matching e.g. 
the scaling in |19| I23|. If decreases with n, corresponding to a sparser regime, then the minimum size 
of K required will increase. 

A nice feature of our work is that we only need p — ^ to be large only as compared to ^/p; several other 
existing results (see Table [l]) require a lower bound (as a function only of n, or n, K) on p — q itself. This 
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Table 1: Comparison with literature. This table shows the lower-bound requirements on K and p — q 
that existing literature needs for exact recovery of the planted partitions/clusters. Here is the soft-J7 
notation. Note that this table is under the assumption that every cluster is of size K, and the edge densities 
are uniformly p and q (for within and across clusters respectively) . As can be seen, our algorithm achieves 
a better p — q scaling than every other result. And, we achieve a better K scaling than every other result 
except Shamir [27], Oymak & Hassibi [25], Giesen & Mitsche|16] and Chaudhuri et al[T^. Perhaps more 
importantly, we use a completely different algorithmic approach from all of the others. 

allows us to guarantee recovery for much sparser graphs than all existing results. For example, when K 

lo ^ n lo n 

is 0(n), p and p — q can be as small as 9( °^ " )• This scaling is close to optimal: if p < then each 
cluster will be almost surely disconnected, and if p — q = o{^), then on average a node has equally many 
neighbours in its own cluster and in another cluster - both are ill-posed situations in which one cannot 
hope to recover the underlying clustering. When K = Q (y^log^ n), p and p — q can be O ^ "'g^ ^ while 

the previous best result for this regime requires at least O ^-^^ [23]. 

Another improvement of our approach over existing ones is that we can handle a large number of 
isolated nodes. For example, as long as we have K = il(-y/ni), then the number of isolated nodes can be 
on the same order as the normal nodes (n2 = 0(ni) = G(n)). 
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Choosing the Parameters 

Our algorithm does not need to know the number or sizes of the clusters. Instead, it has two parameters: 
and c^c. The theorem provides a way to choose their values, if we know the values of the bounds p, q. 
Under the generalized planted partition model, setting p and q is equivalent to deciding how densely a 
group of nodes are connected in order to qualify as a cluster - e.g., consider the case where the graph has 
a hierarchical clustering structure and the edge probabilities are different for clusters at different levels of 
the hierarchy. Therefore, choosing the parameters is inherently subjective in general. 

On the other hand, when the graph does not have such an hierarchical structure, we can estimate p 
and q reliably from the observed data. Consider the standard planted partition model, where all clusters 
have the same size K, the edge probabilities are uniform (i.e., equal to p within clusters and q between 
clusters), and there is no isolated nodes (n2 = 0). In this case, it is easy to show that the first eigenvalue 
of E — I] is K{p — q) — p + nq with multiplicity 1, the second eigenvalue is K{p — q) —p with multiplicity 
— 1, and the third eigenvalue is —p with multiplicities {n — ^) [TTj. This motivates us to use the 
eigenvalues of j4 — I to estimate p and q, as described in Algorithm [l} 

Algorithm 1 Estimate p, q, and t 

1. Compute and sort the eigenvalues oi A — I, denoted as Ai > A2 > • • . > A„. 

2. Let f = argmaxj=2,...,n-i(Ai — Aj+i). Set K = n/r. 

3. Set 

^ KXi+{n-k)X2 



Our empirical results in Section [4] are based on Algorithm [T} The following proposition, stated without 
proof, guarantees that the estimation error is sufficiently small. We plan to include a full proof of this in 
a future preprint. 

Proposition 1. Under the standard planted partition model and the assumption of Theorem\^ with high 
probability, the output of Algorithm^ satisfies 

k = K, 

\J {pK + qn) log n 



\p — p\ < Cl 
\q-q\ < C2 



K 

{pK + qn) log n 



n 

where ci and 02 are absolute constants. 



Extensions 



We focus on sparse graphs in this paper. However, there is nothing that forces us to restrict to such graphs. 
Indeed, our approach also applies to dense graphs with both p and q very close to one. One important 
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example in this regime is the so-cahed planted clique problem [T] , where one starts with a graph consisting 
of several disjoint cliques with size at least K as well as some isolated nodes, and then adds diversionary 
edges between the cliques randomly with probability q. A slightly modified version of our algorithm is 
guaranteed to recover the underlying cliques with high probability provided 

n \og^ n 

More generally, we can handle problems with p < q as well. For example, in the planted coloring 
problem [2T], one assumes that the nodes are divided into several color classes, with each class having at 
least K nodes, and edges are added between nodes with different colors with probability q. The goal is to 
recover the original coloring. Our algorithm succeeds on this problem with high probability as long as 

n log^ n 

We plan to have a formal treatment of these extensions in a future preprint. 



4 Empirical Results 
4.1 Implementation Issues 

The convex program (2) in the main paper can be solved using a general purpose SDP solver, but this 
method does not scale well to problems with more than 100 nodes. To facilitate fast and efficient solution, 
we propose to use a family of algorithms called Augmented Lagrange Multiplier (ALM) methods (see e.g. 
[22]). We adapt the ALM method to our problem, given as Algorithm [2j Here 5ec(-) : M"^" i-)- M"^" is 

Algorithm 2 ALM for Minimizing Nuclear Norm plus Weighted Norm 
Input: A,C G M"^". 

Initialize: M^o) = 0; y(°) = 0;5(°) = 0; /^o > 0; a > 1; A; = 0. 
while not converge do 

{U, S, V) = svd(^ - 5W + /i-^M^). 

y(fc+l) ^ US-i{Y.)V. 

" k 

For ah (i, j), Y^^^^^ = max jmin {y^^^^\ l} , o}. 

^(fc+l) = Mik) ^ ^^(^ _ y(fc+l) _ 

k = k + l. 
end while 

Return 



the element-wise weighted soft-thresholding operator, defined as 



Xij — tCij , if Xij > eCjj 



Xij -\- cCij, if Xij < —eCij 
0, otherwise. 
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In other words, it shrinks each entry of X towards zero by e. The unweighted version S^{-) = Sei{-) is also 
used. The stopping criteria and parameters of the algorithm is chosen similarly to [22j. 

4.2 Simulations 

We perform experiments on synthetic data, and compare with other methods. We generate a graph using 
the planted partition model with n = 1000 nodes, r = 5 clusters with equal size K = 200, and p,q & [0, 1]. 
We apply our method to the data, where we use the fast solver described in the supplementary material. 
We estimate p and q using the heuristic described in Section 3, and choose the weights and c^c according 
to the main theorerrj^ Due to numerical accuracy, the output Y of our algorithm may not be integer, so 
we do the following simple rounding: compute the mean y of the entries of Y, and round each entry of Y 
to 1 if it is greater than y, and otherwise. We measure the error by \\Y* — round(y)||i, which is simply 
the number of misclassifed pairs. We say our method succeeds if it misclassifies less than 0.1% of the pairs. 

For comparison, we consider three alternative methods: (1) Single-Linkage clustering (SLINK) [28j, 
which is a hierarchical clustering method that merge the most similar clusters in each iteration. We use 
the difference of neighbours, namely H^,. — as the distance measure of node i and j, and output 

when SLINK finds a clustering with r = 5 clusters. (2) A spectral clustering method [32], where we run 
SLINK on the top r = 5 singular vectors of A. (3) Low-rank-plus-sparse approach [ISIES], followed by the 
same rounding scheme. Note the first two methods assume knowledge of r, which is not available to our 
method. Success is measured in the same way as above. 

For each q, we find the smallest p for which a method succeeds, and average over 20 trials. The results 
are shown in Figure [T](a) , where the area above each curves corresponds to the range of feasible {p,q) for 
each method. It can been seen that our method subsumes all others, in that we succeed for a strictly larger 
range of {p,q). Figure [l|b) shows more detailed results for sparse graphs {p < 0.3, q < 0.1), for which 
SLINK and trace-norm-plus unweighted ii completely fail, while our method significantly outperforms the 
spectral method, the only alternative method that works in this regime. 

5 Proof of Lemma [1] 

In this section we prove the monotonicity Lemma [T| 

Proof. Let denote the entries that has been changed. Notice that since A and A are different, the 
respective weight matrix C and C are also different. In particular, they differ on Let Y' be an 
arbitrary feasible solution, then we have by the optimality of Y 

||y||, + ||Co(A-y)||i < + ||Co(A-y')lll• 
Next notice that from definition of A we have 

||y||. + ||(5o(i-y)||i = ||y||, + ||C7o(^-y)||i- ^ q,, 

*we point out that searching for the best ca and ca" while keeping ca/ca" fixed might lead to better performance, which 
we do not pursue here 
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(a) (b) 

Figure 1: (a) Comparison of our method with Single-Linkage clustering (SLINK), spectral clustering, and 
low-rank-plus-sparse (L+S) approach. The area above each curve is the values of {p, q) for which a method 
successfully recovers the underlying true clustering, (b) More detailed results for the area in the box in 
(a). The experiments are conducted on synthetic data with n = 1000 nodes and r = 5 clusters with equal 
size K = 200. 



while on the other hand 

||y'||, + \\Co {A - Y')\\i - [\\Y'\u + \\Co {A - y')lli] 

= Y,[Cim-Y%\-Cd\A-Y'\ 

Us 

< 5] Co, 

where the last inequality we use ||A — y||oo < 1, and ||j4 — y' ||oo < 1- Combining all equations together 
establishes that 

||y||* + 11^0(1-^)111 < ||y'||* + ||Co(i-y')||i. 

As y is arbitrary, the lemma follows. □ 



6 Proof of Theorem [T] 

We prove our main Theorem [l] in this section. 



6.1 Notation, Preliminaries, and the Main Idea 

In this subsection, we introduce the notations we use in the proof, and briefly explain the main idea of the 
proof and highlight the novelty of the analysis. 
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Let S* = A—Y* be the true disagreement matrix. Recall that r = ^ of cluster, ki = size of the ith cluster, 
and K = minj ki. Let O = support(S'*). As standard, we denote the singular value decomposition of Y* 
(notice Y* is symmetric and positive semidefinite) be UoTiqUJ , and let Prp±{M) = {I — UqUq )M {I — UqUq ) 
be the projection of M onto the space of matrices whose columns and rows are orthogonal to those of Y* , 
and Pt{M) = M - Pt±{M). 

To exploit the special structure of the clustering setup, we introduce some new notations: Ri = 
{(/,m) : l,m £ cluster i}, R = U^^^Ri = support(y*). For an entry set $ C [1 : n] x [1 : n], we use 
1$ G M"^" to denote the matrix which is one on entries belonging to $ and zero elsewhere. Thus, we 
have Y* = "^^^i Ir^ and UqUJ = Yll=i ^^Ri- Also, recall that V(i,j) G A'^ we have Cij = c^c, while 
y{i,j) G ^ we have Cij = c^. Hence, we may write 

C = c^cl_4c + c^l^ = c^cl_Rnr2+iJ<:nn<= + c^^Rcnn+mn''- 

Notice that in the graph clustering setup where < Aij < 1 for all i,j, the sparse corruption matrix 
S* can not be arbitrary. In particular, we have that {S*)-j =0 or -1 for G R, and {S*)ij =0 or 1 for 
G i?*^, which implies 

S* = sign(5*) = -InnR + Inni?'^- 

Moreover, the non-zero diagonal entries of S* , namely the set {i : S^^ = 1} = {i : = 0} C n R'^, 
correspond to the n2 isolated nodes. Let {Y* + A, S* — A) be a feasible solution to the convex program Q. 
Because of the constraints ([S]), A must belong to the set of possible deviations !D, defined as 

S) = {A G M"''"|V(i,i) €R: -1 < Aij < 0; V(i, j) G i?" : 1 > Ay > 0}. 

Observe that for any S*j and Aij either have same sign, or at least one of them is zero. Thus for any 
A G D, 

{CoS*,A) = \\Pn{CoA)\\,. (4) 

The proof consists of two main steps. In step 1, we develop a new approximate dual certificate 
condition, i.e., a set of stipulations which, if satisfied by any matrix W, would guarantee the optimality of 
{Y* , S*). Then, in step 2 we explicitly construct a W and show that it satisfies this stipulations with high 
probability. 

We briefly explain the novelty in the analysis. While at a high level the above two steps have been 
employed in several papers on sparse and low-rank matrix decomposition, our analysis is different because 
it relies critically on the specific clustering setting we are in. In particular, we take advantage of the fact 
that any feasible deviation must belong to the set D and hence Equation Q holds; our approximate dual 
certificate condition and our choice of the dual certificate W are customized to this setting. As a result, 
even though we are looking at a potentially more involved setting with input-dependent weights on the 
sparse matrix regularizer, our proof is much simpler than several others in this space. Also, existing proofs 
with unweighted regularizers do not cover our setting. 

6.2 Step 1: Dual Certificate Condition 

The following proposition provides a sufficient condition for the optimality of {Y*,S*). 
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Proposition 2. If there exists a matrix W G M"^" and a positive number e obeying the following conditions 

1. \\Pt^W\\ < 1. 

2. \\PT{W)\\^<^mm{c^c,c^} 

3. {PniUoU^I + W), A) = (1 + e) \\Pn{C o A)||i , VA e S). 

4. {Pn^UoU^I + W),A) > -(1 - e) \\Pn4C o A)\\, , VA G 2) 

then {Y*,S*) is the unique optimal solution to 

Proof. Prom the first condition we know UqUJ + Prp±{W) is a subgradient of lll^H^ at Y*. Consider any 
feasible solution (Y* + A,S* - A) with A e D and A / 0. For this A, we can choose F £ f]^, ||F||^ < 1 
such that {C o F, -A) = \\Pnc{C o A)||^ ; in this case C o {S* + F) = C o (sign(5*) + F) is a subgradient 
of ||C o at 5*. We have the following chain of inequalities. 



+ A\l - \\Y\l + \\C o (S* - A)\\, - \\C oS*\\, 
= \\Y* + All, - ||y||, + lie o (S* - A)\\, - \\C oS*\\, 

> (UqU^ + Pt± {W),A^ + {CoSo + Co F, -A) 



UoU^ +W,A^- {PtW,A) - \\Pn{CoA)\\^ + \\Pnc{CoA)\\^ 
PniUoUj + W), a) + (PnciUoUj + W), a) - {PtW, A) - \\Pn {C o A)||, + ||Pn=(C o A)||, 



> (1 + e) llPf, (C o A)||, - (1 - e) \\Pnc{C o A)||, - {PtW, A) - \\Pn (C o A)||, + {{Pn^C o A)||, 

> 6 \\Pn{C o A)||i + e \\Pn4C o A)||i - ||PtH^|L I|A||i 



> 

> 0. 



> e ||C o A||-^ — - min {c^c, c^} II A||-|^ 



Here (a) uses the definition of subgradients, (b) follows from Q and our choice of F, (c) uses condition 3 
and 4, (d) uses the duality between ||-||g^ and ||-||j^, and (e) uses condition 2. This proves that {Y*,S*) is 
the unique optimal solution to ([2]). □ 

6.3 Step 2: Constructing W 

In this subsection we construct a W and show that it satisfies the conditions in Proposition [2] w.h. p. Let 
E = {{i,i),i = 1, . . . , n} be the set of diagonal entries, and Pe and Pe<^ be defined similarly to Pq. We 



12 



define W = Wi + W2 + W3 with Wi given by 



"^=1 {j,i)e{'Rmn£;':)nn= ''^^-^ 



= (1 + e) 



u o i-'£;c ^^ ; H -'-(_R,n_E=)nQ'= ~ 7^ — ^{R'=nE'=)nn'= 

Pij i 'i'ijr 



where pij is the probability of an edge being present between node i and j for in the same cluster, 
and Qij is the edge probability for all other by definition we have p > pij and q < qij for all (i, j). 

Intuitively speaking, the idea is that Wi and W2 are zero mean random matrices, so they are likely to have 
small norms. The matrix W3 = (1 + e)c_APE{S*) = (1 + e)c^Pn(-^) is a diagonal matrix, with (^3)^ = 1 
for those i corresponding to the isolated nodes, and it also has small norm. 

To prove Theorem [T| it remains to show that W satisfies the desired conditions w.h.p.; this is done 
below. 

Proposition 3. Under the assumptions of Theorem^ W with e = satisfies the conditions in 

Proposition^^ with high probability. 

The rest of this section is devoted to the proof of this proposition. We need two technical lemmas. First 
observe that, due to the randomness of Q, Wi and W2 are symmetric random matrices with independent 
zero- mean entries. Moreover, the magnitude and variance of the entries are bounded as in the following 
lemma. 

Lemma 2. Under the assumption of Theorem^where ci > 16, the following holds 

1. e<\. 

2. The magnitude of the entries of Wi and W2 is bounded by 



16 log n 

3. The variance of the entries of Wi and W2 is bounded by 256n\og n ■ 

Proof. Note that p — q > ci nypn jj^pj^gg ^ > ^2 ^ ^og^ n > c\ " , which further implies K > 
ci ^/n log^ n since p < 1. It follows that 

21og^n [n 21og^n K 1 

e = \ — < 9 — < 

K y p K ci \og n 4 

The entries of Wi are either -t:^ or ^-^J-. Note that ^ < ^ < — ^ , and ^-^t^ < 4? < 

i-- Pij km km — K — Clv^log^Jl' Pij Km — pK — 



f" 4 < — , 4 . So the entries of Wi are bounded by 



cin log n C2 log n ' 16 log n' 

The entries of VF2 are (l+e)c^, — (l+e)c^c, (1+e) c^c, or — (l + e)j^^c^. The magnitude of them 
is bounded by max {|c^c, 2c^} = max {| min (i^p)Liogn ' / ' I (\/ } ^ 



1 

16 log^ n ' 
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The variance of the entries of Wi is (1 — pij) /{pijk'^). Since km> K and pij > p> cin log^ n/K'^ , the 
variance is upper bounded by ^^^4 ^ , and further upper bounded by l/256n as n > 4. 

The variance of the entries of Wo are either ^-^^c\c or , "^'^ c\ . Note that ^-^^c\c < ■ 

Pij Pij P 

Wi^^diE = 256niogn ^nd j^c^j^ < j^w^^d^ = 256niogn - ^his Completes the proof of the 
lemma. □ 

We also need the following simple lemma. 
Lemma 3. Under the assumption of Theorem^ where ci > 16, we have 

{l + ef-^^^ < (l-2e)c^, 
p 

il + e)^_ < (l-e)c^c. 
1-q 

Proof. If ^ < i2sl^, then ca = /, ,f^. In this case, (1 + e)'^^^^ < 2^,/, ./M < 



1 1 



i ieVniogn ^ (1 - since g < 

If > 1, then CAc = }. ; we also have e < ^^°g "-v^ < 1 since K > ci-v/nloe^ n. In this 

l—p — ' 16vnlogn' — K — o _±vo 

case, (1 + er^^ < f,^^ < since q < i Similarly, we have (1 + < 1,^7^7^ ^ 
i ievniogn ^ (1 - 2e)c.4= since g < i. 

lo n 

If < ^ and > 1, it is easy to verify that both inequahties in the lemma are implied by 



By assumption of Theorem [l| we have p — q > ci VP^^s _ji > when ci > 16. Notice that we have 
4e > 4e/(l + Ae"^) by e > 0, and because p > q, we have 2p > {p + q). Multiplying the two inequalities, we 
have 

p-q>8pe> Y^^{P + Q)^ 

which implies (1 - 2efp - (1 + 2€fq > > (1 - 2€)^pq - (1 + 2€)'^pq. Notice that e < j by Lemma[2| The 
desired inequality ([s]) follows easily. □ 

Now we are ready to proceed with the proof of Proposition [3j which can be divided into 4 steps, 
corresponding to checking each of the 4 conditions in Proposition [2] 

(1) Bounding UPt^xH^H. 

The matrix W3 = (1 + e)c^Pn(-^) is diagonal, and thus satisfies 



mW = (1 + e)cA < 2 ■ 1 min J < \. 

16V'^logra tV 1 \\ogn) 4 
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Recall that Wi and W2 are random matrices with i.i.d. entries having bounded magnitude and variance. 
We apply standard results on the spectral norm of random matrices (Lemma |4] in the Appendix) to obtain 

\\Pt^{W)\\ < \\Wi\\ + \\W2\\ + WWsW < 12max| \ log^n, A ^^lognj + ]<!■ 

tl61og n IbV'T-logn J 4 



(2) Bounding \\PtW\\^. 

Recah that PrWi = UoUflWi + WiUoUf^ - UoUf^WiUoUj. For i = 3, because UqU^ is a block 



diagonal matrix supported on R, whereas W3 is diagonal supported on R'^, it is easy to see that UoU^Ws = 
W^UqUq = 0. Thus we have PtWs = 0. For i = 1, 2, we bound each of the three terms in PxWi. 

Since Wi is a random matrix and UoUjWi = (Ylm F~l^?m)^i) each entry of UqUJWi equals ^ times 
the sum of km independent zero- mean random variables, whose magnitude and variance are bounded as 
previously discussed. Standard Bernstein inequality (Lemma [5] in the Appendix) yields w.h.p. 



< iTCsmaxi ^ logn, — *— v^ETlogrel 
K [leiog^n 16y/n J 



1 ( 1 

< - 

00 



1 Vlogn 1 

< C3 max < ^ , —= > 

16K\og^n 



log n 



96K' 

where the last inequality holds for n large enough. 

By an almost identical argument, we have ||(Wi[/o{7Q^)||^ < Furthermore, 

UoUjW.UoU^ = (^ -Li^j[w,UoU^], 



which implies that 
Thus, we have 



\UoU^W^UoU^\U < ||T^iC/oC/o^||oo < 

log n 



\\PtW\\^<\\PtWi\\^ + \\PtW2\\^< 



16K 



On the other hand, we have cac > , , • -d/^log n > -A^ and CAce = t«a/ 1 — ^ r-, — 

' — 16vnlogn K y p — »A lo y 1— PVnlog 

^■^log^ n> so le mm {cA,CA'=} > j§^- We conclude that ||-Pt1^|Ioo ^ min {ca, ca4 ■ 
(3) Computing {Pn{UoUo + W),A). 

From Equation Q we have (C o 5*, A) = ||C o A||^. Using the definition of W, we obtain 
'PniUoU;i) + Pn{W),PnA) = {{1 + e)C o S* , PnA) = (1 + e) \\Pn{C o A)\\^ . 



(4) Bounding(Pn^(C/oC/o + W),A). 
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Observe that 

[PRnn-iUoUj -^W),A 



\m=i " m=i ™ (ij)e(i?mn£;<:)n!^': -^'-^ {i,j)eiRnE'=)nn'^ I 

(b) 

> -(ec^ + (l-2e)c^)||Pijnn=(A)||, 
-{l-e)\\PRnn4CoA)\\,, 

where in (a) we use the fact that when n is large enough and p > q, p > cf " , we have 



2 log^ n Fn 1 . f ll — q I n 



ec^ = — — \ — ■ 1^^^= mm 



K y p 16-y/nlogn [V ' V ^^S^ 



1 ■v/plog'^/^n f /I — g / n 
> — - • min 



^ 1 

- pK' 

also (b) follows from Lemma [sj and (c) holds since Cij = for (i, j) G -RPl ri^. Similarly, we have 
{PRcnn^W,A) = /-{l + e)cA ^ 1 A \ 

> -(l + e)^||PM(A)||i 

> -(l-e)c^c||PRenoc(A)||, 
= -(l-e)||PRcnf,c(CoA)||i, 

where we use the Lemma [3] in the last inequality. Combining pieces, we conclude that 

[Pn^UoUj + W),Aj > -{l-e)\\Pn^{CoA)\\,. 
This completes the proof of Proposition [Sj Combining with Proposition [2| this proves Theorem [TJ 

7 Conclusion 

We presented a convex optimization formulation, essentially a weighted version of low-rank matrix decom- 
position, to address graph clustering where the graph is sparse. We showed that under a wide range of 
problem parameters, the proposed method guarantees to recover the correct clustering. In fact, our theo- 
retic analysis shows that the proposed method outperforms, i.e., succeeds under less restrictive conditions. 
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every existing method in this setting. Simulation studies also validates the efficiency and effectiveness of 
the proposed method. 

This work is motivated by analyzing large-scale social network, where inherently, even actors (nodes) 
within one cluster are more than likely not having connections. As such, immediate goals for future work 
include faster algorithm implementations, as well as developing effective post-processing schemes (e.g., 
rounding) when the obtained solution is not an exact cluster matrix. 
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Appendices 



A The spectral norm of random matrices 



It is well-known that the spectral norm Ai(^) of a zero-mean random matrix A is bounded above w.h.p. 
by C^/n, where C is a constant that might depend on the variance and magnitude of the entries of A. 
Here we state and (re-)prove an upper bound of Ai(A) with an explicit estimate of the constant C , which 
is needed in the proof of the main theorem. 

Lemma 4. Let Aij, 1 < i, j < n be independent random variables, each of which has mean and variance 
at most (T^ and is bounded in absolute value by B. Then with probability at least 1 — 2n~^ 

Ai(^) < 6 max < cj-y/n logn, B log^ n > 



Proof. Let Cj be the i-th standard basis in M". Let Zij = AijCicj . Then Zjj's are zero- mean random 
matrices independent of each other, and A = Yli j ^ij- We have \\Zij\\ < B almost surely. We also have 
\\Ei,JnZ^,ZJ■)\\ = \\Eie^eJEjH^^J)\\ < ncT^.' Similarly \\Z^,,HZJ■Zi,J)\\ < na\ Applying the Non- 
commutative Bernstein Inequality (Theorem 1.6 in [30j) with t = 6 max |(Tt/?t- logn, B log^ n} yields the 
desired bound. □ 



B Standard Bernstein Inequality for Sum of Independent Variables 



Lemma 5. (\31^. Proposition 5.16) Let Yi,...,Yf,} be independent random variables, each of which has 
variance bounded by cr^ and is bounded in absolute value by B a.s. Then we have 

N r ^ 

1=1 li=l 

with probability at least 1 — cin~^^ where the positive constants cq, c\, C2 are independent of a, B, N and 
n. 



< 



Co max |(Ty^iVlogn, B log n| 
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